Predicting Facebooks' post share volume
based on mother page like count and week day of publishing



Personal Challenge - Andrzej Krasnodebski

4146123

4th Semester ICT & Business Student
Artificial Intelligence Specialization
Fontys University of Applied Sciences


Version History

Version
Date
Changes
0.1
28/02/2022
- Initial layout (Created)
0.2
01/03/2022
- Preface (Created)
- Introduction (Created)
- Domain Understanding (Created)
0.3
02/03/2022
- Phase 1 (Created)
0.4
03/03/2022
- Phase 2 (Created)
0.5
05-13/03/2022
- Phase 2 (Extended)
- Phase 3 (Created)

0.6

14/03/2022

Iteration 0 (Submitted)

1.0
21/03/2022
- Iteration 0 (Approved) - Feedback section (Added)
- Feedback after Iteration 0 (Added)
- Fixed spelling mistakes (Applied to whole document)
- Clients' information (Added to - Phase 1)
- Clients' benefits (Added to - Phase 1)
- Interview planning (Added to - Phase 1)
- Societal and people impact (Added to - Phase 1)
- Domain understanding (Further research)
1.1
22/03/2022
- Removed excessive theory (Correlation and STD formulas)
- Figures numbers (Added)
- Conclusion for Iteration 0 (Extended)
- 6.3 Evaluation (Extended)
1.2
23/03/2022
- Required data elements, Phase 2 (Extended)
- EDA (Extended)
1.3
25/03/2022
- Preprocessing (Extended with outliers and no outliers sets)
1.4
28/03/2022
- Modeling (Added regression, changed kNN)
1.5
29/03/2022
- Evaluation (Added regression, updated kNN)
1.6
30/03/2022
- Domain Understanding (Extended points 4.1.1-4.1.6)
- Conclusion Iteration 1 (Added)
1.7
31/03/2022
- Table of contents (updated)
- Addressing feedback from Iteration 0

1.7

01/04/2022

Iteration 1 (Submitted)

2.0
12/04/2022
- Iteration 1 (Approved) - Feedback after Iteration 1 (Added)
- Explanation to points (Why part, Added):
  * 5.3.4
  * 5.3.5
  * 5.3.6
  * 5.3.7
  * 5.4
- Phase 3 Explanation (Added, Modified):
  * 6.1.1
  * 6.1.2
  * 6.3.2
2.1
13/04/2022
- Interview with expert (Answers added)
- Iteration 2, Section (Created)
- Phase 3 in Iteration 2(Created)
2.2
14/04/2022
+ Phase 4 in Iteration 2(Created)
- Table of contents (updated)

2.2

15/04/2022

Iteration 2 (Submitted)

3.0
$$/04/2022
3.1
$$/04/2022
3.2
$$/04/2022
3.3
$$/04/2022




Table of contents, Iteration 0 & 1

  1. Preface

  2. Introduction

  3. Client
    3.1 Who is my Client?
    3.2 How will my Client benefit from this project?

  4. Proposal (Phase 1)
    4.1 Domain Understanding
     4.1.1 Research methods
     4.1.2 Facebook overview
     4.1.3 Facebook users
     4.1.4 Marketing on Facebook
     4.1.5 Facebook pages and engagement
     4.1.6 Facebook's future
     4.1.7 Interview with domain expert
     4.1.8 What impact has this project on society and people?
    4.2 Data Sourcing
    4.3 Analytic Approach

  5. Provisioning (Phase 2)
    5.1 Data Requirements
     5.1.1 Domain
     5.1.2 Stakeholders
     5.1.3 Required Data Elements
     5.1.4 Candidate Data Sources
    5.2 Data Collection
    5.3 Data Understanding
     5.3.1 Importing libraries
     5.3.2 Importing data
     5.3.3 Explaining column names
     5.3.4 Computing Summary Statistics
     5.3.5 Visualizing Correlation
     5.3.6 Standarization
     5.3.7 Examining page categories
     5.3.8 Examining post data
    5.4 Data Preparation

  6. Predictions (Phase 3)
    6.1 Preprocessing
     6.1.1 Data Standardization
     6.1.2 Selecting features
     6.1.3 Dividing data into train and test set
     6.1.4 Removing outliers
     6.1.5 Selecting features (with no outliers)
     6.1.6 Dividing data into train and test set (with no outliers)
    6.2 Modelling
     6.2.1 Linear Regression
     6.2.2 k-Nearest Neighbors
    6.3 Evaluation
     6.3.1 Linear Regression
     6.3.2 k-Nearest Neighbors

  7. Conclusion

  8. Appendix

  9. Feedback

  10. References

Table of contents, Iteration 2

Iteration 2

  1. Na
  2. Na
  3. Na
  4. Na
  5. Na

  6. Predictions (Phase 3)
    6.1 Preprocessing
     6.1.1 Removing outliers
     6.1.2 Scaling features
     6.1.3 Selecting features
     6.1.4 Dividing data into train and test set
    6.2 Modelling
     6.2.1 Visualization - Linear Regression Prediction Surface
     6.2.2 Visualization - SVR Prediction Surface
    6.3 Evaluation
     6.3.1 Support Vector Machine

  7. Delivery (Phase 4)
    7.1 Model selection
    7.2 Model deployment
    7.3 Application field testing
    7.4 Collecting & Documenting
    7.5 Presentation & Reporting

  8. Conclusion

  9. Feedback

  10. References




1. Preface

My name is Andrzej Krasnodebski and I am a 4th semester Student at Fontys University of Applied Sciences in Eindhoven, Netherlands.
I follow ICT & Business profile and currently am enrolled in Artificial Intelligence specialization.

This markdown file/document presents my personal challenge I need to perform to prove my learning outcomes. It will be updated accordingly as I progress with the course.
The goal is to make a whole Machine Learning project following the IBM Data Science Methodology consisting of 4 main phases:

  1. Proposal (Phase 1)
  2. Provisioning (Phase 2)
  3. Predictions (Phase 3)
  4. Delivery (Phase 4)

Disclaimer:
This file is a whole passthrough of my work process with code, visuals, descriptions, analyses and sometimes personal thoughts or comments (and please treat them accordingly). I will do my best to create the best model for given scenario, however I will not promise anything except experiments. The idea is to showcase here everything that came up to my mind during the work process and unless something is a total educationally misleading disaster, I am going to leave it in this report for reference.

Go back to Table of contents.



2. Introduction

To stay connected to the every day world, people use many different social media platforms. From all of those, Facebook is one of the biggest and oldest one.
It is an American multinational technology conglomerate based in Menlo Park, California. The company is the parent organization of Facebook, Instagram, and WhatsApp, among other subsidiaries. Meta is one of the world's most valuable companies. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Apple, and Microsoft. Founded in 2004 by Mark Zuckerberg, currently unites over 2.9 billion users.

Project goal:
Predict the number of Facebook post shares based on the page popularity and the weekday of publishing.

Go back to Table of contents.



3. Client

3.1 Who is my Client?

My client is the owner and current administrator of an e-commerce oriented Facebook page with almost 1 million likes. It represents a profile of a company offering its' services on this market. Because of this sectors' specifics, the page is constantly monitored and updated with various content about the company, ongoing projects, job opportunities and e-commerce nuances. As a firm, they excel in digital marketing, web design and supplier selection which secures their position on the market.

For privacy reasons, the company asked not to mention their name, Facebook page and logo until the delivery of the final product.

One of their goals is to constantly improve the quality of services offered to new and current clients. In order to do that, they want to be able to predict the Facebook post share count to see weather the information, shared by them or by their client, is likely to be spread across users and by that, is interesting for them. Two factors, they want to take into account during this prediction are mother page like volume and weekday of publishing.

The company reached out to me on LinkedIn asking to take up this challenge as they believe it will be a superb learning opportunity for me and a convenient financial option for them.

3.2 How will my Client benefit from this project?

Algorithm predicting post share count will improve and extend services offered by my client.

Go back to Table of contents.



4. Proposal (Phase 1)

The first phase of the project starts with focus on researching chosen domain and better understanding of the topic.
Afterwards, moving into Data Sourcing and searching for data enrichment.
Finishing with defining a clear goal for the modelling and choosing the best approach to the project.

The domain I will be researching is Facebook posts and pages, how it works and what are the behavioral factors.

This domain is part of two bigger ones:
Social Media -> Facebook -> Posts and Pages

To structure my research, I come up with research questions I will be trying to answer with my findings.

Main Research Question:

RQ: How to predict the posts' shares volume based on mother page like count and weekday of posting?

Sub Questions:

SQ1. How does Facebook position posts on the main page?
SQ2. Does the weekday and time influence posts' reach?
SQ3. What days are the best to post in order to achieve the most post shares?
SQ4. What factors influence post shares volume the most?
SQ5. Can a Machine Learning algorithm predict the posts' shares volume based on selected features?

4.1 Domain Understanding

This section is intended to document my findings about the domain I am researching and is crucial for every AI project. I will try to collect as much useful information as possible and based on that, draw meaningful conclusion afterwards, in the project. As Facebook is what I use on a weekly basis I am confident in what I already know and treat my knowledge as a solid base for this exploratory research.
Social media is a very broad topic with new insights appearing every day. Because of this fact, researching one of the biggest platforms might be overwhelming. However, it also means that there is a lot of materials easily available and finding a suitable one will not be a problem. The goal for this phase is to broaden and sort my knowledge about this domain and to make sure I do it properly I want to follow a research pattern on which I elaborate below.

4.1.1 Research methods

There are many ICT research methods available and in a perfect scenario I would use them all and gain experience in every method. However, not every research method is suitable for every problem and a good researcher chooses the ones that suits it the most. I want to be a good researcher and in the meantime keep this section concise and valuable.

Reaching out to the ICT research methods, help me to pick the best combination for my project.

Research methods I will be using:

4.1.2 Facebook overview

I start the Domain Understanding with a quick research of factors that influence posts' popularity - number of shares. The exact Facebook algorithm's that are responsible for post positioning on the main page are classified and can only be surmised.

You can find the results of my findings on the graphic below.

Every publicly available source I use is referenced here .

Factors influencing posts' popularity
Fig.1 - Factors influencing posts' popularity.

There are so many factors that are influential to the post popularity. I believe, most of them are publicly known but additionally there are some yet to be discovered by public.

4.1.3 Facebook users

The exact number of active facebook users is unknown, however the assumptions indicate around 2.9 billon monthly active users (Q4 2021).

"The platform surpassed two billion active users in the second quarter of 2017, taking just over 13 years to reach this milestone. In comparison, Meta-owned Instagram took 11.2 years, and Google’s YouTube took just over 14 years to achieve this landmark. As of October 2021, Facebook’s leading audience base was in India, with almost 350 million users whilst the United States ranked second with an approximate total of 193 million users. The platform also finds remarkable popularity in Indonesia and Brazil, with well over 100 million users in both countries."

- According to Statista.com

facebook-marketing-statistics-graph-1.png facebook-marketing-statistics-graph-1.png

On the screenshots above to which source I reference here I can find many useful information regarding my project.
Over three-quarters of Facebook users use this site daily which results in huge engagement on profiles every day.
What is interesting and gives high hopes is that sharing with many people at once is the top reason why men use facebook and the 2nd for women.

Lets look at the age of the users.

On the amazing interactive chart which I reference here I can see that the 18-29 group till 2015 had the highest percentage of social media usage. From 2015 the age group of 30-49 joined them and from 2017/2018 the values for all 3 age groups are in close proximity. This draws a conclusion that each year social media and technology becomes more available to elderly people.
However, there is one factor that scales this effect.
People from age group 18-29 that were in the 'peak' in 2009/2010 ten years later so 2019/2020 are still using social media but are now in older age group of 30-49. This repeats also for the 'older' age groups.

4.1.4 Marketing on Facebook

Facebook original intention was to be a social network for college students, even at one time it required an .edu email for registration purposes.
Nowadays, Facebook business profiles are one of the most effective ways of marketing for their owners and help to reach many more clients for a very decent price compared to results.
According to, Statista.com 77% of Internet users are active on at least one Meta platform (Facebook is owned by Meta and is their largest platform) giving the business owners an incredible opportunity to reach enormous potential new clients, everything on one platform.

The company makes sure that setting up a business profile and creating firms image is easy, convenient and free of charge for its users making facebook's services a no-brainer in the marketing niche.
From the graph below(Source: Statista.com ) we can see that Facebook is used by its users on average 33 minutes a day which easily makes the 'marketer math':

the most users + the most time spent = the most opportunity for marketing.


Facebook has developed and is currently offering a special features for marketers and business profiles making advertising even more beneficial and customizable.
Facebook Ads, which is the name of this service, provides its users with full configuration of marketing campaign. The number of options for ads optimization is endless and will most certainly suit every one interested. While over 50% of consumers want to discover new products through Facebook Stories, its ads reach 34.1% of the global population over age 13.

This is a very brief information about marketing on Facebook. However, in my opinion it is enough to realize the enormous potential of this service and start to understand how it works.

facebook-statistics-10 facebook-statistics-10

4.1.5 Facebook pages and engagement

A Facebook page enables businesses, brands, celebrities, initiatives and organizations to reach their audience free of charge. Facebook profiles can be private while pages are public. Google can index a page, which will make it easier to find. You can operate your Facebook page and platforms such as Facebook Business Suite and Creator Studio on your desktop and mobile device.

A Facebook page allows ist owner to promote his company and keep in touch with users. The engagement indicator allows him to see how many people were influenced by ads on the page and its posts. Thanks to this, owner can assess the level of matching the ads to the audience. Page Activity takes into account interactions about Facebook page and its posts influenced by ads. On-page activity can include things like 'Likes', marks a post with 'Super' reactions, 'Checks' in to a location, clicking a 'Link', and so on.

Advantages of pages:

The most popular page functions:

4.1.6 Facebook's future

Those are the questions many wish to have an answer to.

History continues to amaze us with its irony. The fact is that although Facebook has been successful in making the lives of billions of people public, user privacy is its future.

"The sources of change in the way people communicate are instant messaging, small communities and ephemeral content," said Mark Zuckerberg to shareholders during the last meeting on the periodic discussion of financial results for the fourth quarter of 2019.

For this reason, WhatsApp, Instagram and Messenger are becoming the main driving force behind the development of Facebook. The best evidence is the company's performance data and a look at which way money is starting to flow from the advertisers themselves.

Looking at potential new sources from which Facebook can derive new and greater profits. Even though many of us still log on to Facebook every day, the iconic blue news feed isn't as eye-catching as it used to be. The attention of users, especially the young ones, is shifting to Stories. Research shows that this Instagram format generates 15 times more engagement than any other place in the Facebook ecosystem.
This is also confirmed by marketing budgets. About 98% of Facebook's advertising revenue in the last quarter came from mobile devices. This should come as no surprise to anyone. GlobalWebIndex data shows that mobile phone traffic currently accounts for more than half of the time we spend online, and we spend half of our time there on social media.


It's hard to say exactly what the future of Facebook will be like, but the company seems to be more fortunate than smart. Privacy has become a valuable commodity. Now the company can start making money on the fact that it gives us a substitute for what it has taken away from us to a significant extent. However, there is something else worth paying attention to.

Over the last few years, Facebook has tried to impose its vision of the Internet on its users, which resulted in a drop in engagement. Recently, there has been a retreat from such activities and the company is trying to follow people, an example of which is the development of groups that came from users. Business will benefit if it follows the same path, i.e. puts people at the center of its attention. We all benefit from this.

4.1.7 Interview with domain expert

To further investigate the domain I will conduct an interview with expert on this field and present my findings here. Beforehand I am going to prepare questions and topics I wist to cover during the meeting and plan to moderate the discussion.

Interview with Natalia Nadolska & Karolina Dlubek marketing specialists at Digital Care group.

Click to expand the interview details.
#### Interview questions:
- Q.1 - In what industry do you operate and what are its' specifics?
I am responsible for social media management at the company that is an Authorized Reseller and Service Provider of Apple equipment. We offer services as well for B2C as B2B clients. Our customers can come directly to our stores or get to know our whole offer online – through our website or Instagram.

- Q.2 - Do you believe that Facebook page is still wort attention of business owners?
I believe that Facebook as a marketing tool still has an important function and is worth our attention. It lets you customers to stay connected with your brand, participate in conversations and read the lates news.

- Q.3 - What are the advantages of marketing on facebook?
Apart from those mentioned before, Facebook has a giant audience of over a billion users, which means there has to me your target group somewhere between them. Facebook tools can help you to find and target them. As a business page owner, you also have access to Facebook Insights which allows you to see what posts your audience prefers as well as their specific demographic information, which is crucial while creating a successful marketing strategy.

- Q.4 - Do you believe that some weekdays are better for post sharing to get the biggest reach?
Well, I believe there is no rule that is right for all industries. It's hard to say that a post shared on Monday will always get more engagement than one shared on Wednesday. It is important to publish posts regularly to build and engage our audience. It's essential to keep an eye on your social media insights, and it's very likely that you'll be able to figure out which days are best for publishing to your specific audience.

- Q.5 - What is more important page like count or page content reach?
I would say that content reach is more important for business owners. A wide reach of content can bring you potential customers who may buy your products without liking your page. The page like count is not always an indicator of customer engagement, number of sales or loyal customers.

- Q.6 - How do you measure content reach and what does it tell you?
Well, it's pretty simple - when it comes to social media, we have all the statistics. We can see the reach of each post, we can see how many people have accessed it, we can see the level of engagement, the number of likes and comments. This allows us to know what type of posts our customers prefer, what to publish to get them to interact or what is the best way to create a conversation.

- Q.7 - What do you think influences the posts' reach the most?
Undoubtedly, in order to attract any kind of customer interest our post must be 1 - visually appealing 2 - it must carry some value. When a post is of interest to a few people, its reach is also affected by the level of user engagement, each Like and comment increases our chances of reaching a greater audience.

- Q.8 - Are there any trends in Facebook marketing?
Some examples that come to mind: chatbots - a very easy and convenient way for customers to interact with your brand, allowing them to ask questions at any time. Next, the ability to make purchases through Facebook - it's now possible to highlight products in photos and posts you share so your audience can easily identify and purchase them. Facebook stories - a great way to interact with your customers, you can share an add about your product or a video of you speaking - this helps build relationships with your consumers and allows them to get to know you better.

- Q.9 - Do you think allocating funds for page development is worth it or it is better to allocate them elsewhere?
Yeah, I think it is. I think every business owner should know at least a little bit about Facebook capabilities like the ones we talked about today. Understanding that and knowing their target audience it should help them make a decision. As I said before, I think Facebook is a powerful tool if used properly. Also, we don't have to spend a huge amount of money on Facebook to grow our page, create great audience and engagement.

- Q.10 - What is facebook's position on current social media market?
I know it may seem like Facebook is going away into oblivion. People often think of it as a platform for older people, especially when compared to Instagram or TikTok. Contrary to these well thought out theories, Facebook's position in the market is still very strong and it is a great marketing tool, as I mentioned earlier, it has many advantages and a huge audience. It is true that the vast majority of Facebook users are over the age of 20, so perhaps when it comes to targeting children, it makes more sense to use other platforms.

- Q.11 - Many people say that nowadays Facebook is for old people, what is your position on this?
Well, just as I answered before – it is not entirely true. If we take a look at statistics of the US Facebook users 17.4% of them are between 18-24 and another 25.4% are those between 25-34.

- Q.12 - Do you think that predicting the posts share count based on mother page like volume and weekday of publishing using AI might be of any use?
I am not really into AI, though, I'd like to be. I think yes, it might help with plaining ads.


4.1.8 What impact has this project on society and people?

Next to the value that this project brings to the client there is also impact that it has on society and people.

Besides the scientific use this project is of no interest to individual people and has no direct impact on them. Of course, it may be used as a learning or research resource.

However, it might impact the society indirectly.
As Facebook and Facebook marketing is targeted at selected society target groups using this project deliverable for those purposes will affect them. For now I can not think about any negative impact this project can make as it is intended mostly for research purposes. This niche is unlike to affect society in a harmful way, not physically at least. However it is all about the usage, fake news and hate speech are one from many negative outcomes of social media abuse and despite being quite extreme situations I have to take them into account.

4.2 Data Sourcing

Click to expand data sourcing.

To successfully pursue this project I need a solid data base with a quality data. For the purposes of the ML model I need at least 3 features:
- Number of posts' share count - the Target Variable
- Number of mother page like count
- Weekday of post sharing
Of course, that will not be enough for the domain understanding and deep EDA (Exploratory Data Analysis). I am going to look for much more features to enrich my data:
- Number of words in the post
- Mother page category
- Number of posts' comments
- If the post was promoted
- etc.
In search for suitable for this project data I am going to:
1. Search for existing data
2. Search for data APIs
3. Search for web data to scrape

Data Sources I managed to find so far:
* Facebook Comment Volume Dataset - looks very promising, ,many features to support modelling and EDA
* Online News Popularity Data Set - not related to Facebook, but to the Social Media domain
* Removed Facebook Pages: Engagement Metrics and Posts - relation with post shares
* Facebook Data - not related to shares, info about users
* Facebook various data sets - for the domain understanding

4.3 Analytic Approach

Go back to Table of contents.



5. Provisioning (Phase 2)

The second phase of the project is focused on data.
Starting with more theoretical aspects, moving into data understanding and preparation for modelling.

5.1 Data Requirements

5.1.1 Domain

After Phase 1, already know from what domains I need the data from.

  1. Facebook domain - for domain understanding
  2. Posts and Pages sub-domain - for modelling

Domain 2 is relational to domain 1 as it is a sub-domain. Facebook domain is especially needed to full understand the main one for modelling and EDA.

5.1.2 Stakeholders

Storing this kind of data is beneficial for many stakeholders in and outside those domains.
Facebook is one of the largest marketing playgrounds and gathers many companies interested in advertising there. Also, page administrators and owners want to have
insights into their products performance. On top of that, Facebook company aims to deliver its' customers the highest vale and best experience in using this platform.
To do so, they need to constantly improve their services and again, to do so, they need to measure it first.

5.1.3 Required Data Elements

Data required for the model consists of 2 facts and 1 dimension.
Target_Post_Share_Count and Page_Popularity being facts and Wday_Publishing as a time-related feature being a dimension.

5.1.4 Candidate Data Sources

Candidate data sources should be identified based on the facts.
The candidate data sources containing the identified data elements, should be reviewed on collection of data facts. If these facts make sense with regards to the information they contain, I may continue by focusing on the dimensions.

I have already listed several candidate data sources here.

5.2 Data Collection

Click to expand data collection.
- Information
The information I want to collect is already described in the [Data Sourcing](#DataSourcing) section.
I will not be looking at any specific time frame as my target variable is not time sensitive in terms of years or seasons.

- Storage
At this (initial) stage of the project I will be storing the data on my hard drive which will allow me to access it easily and fast.
Perhaps, when the project advances I will be forced to use cloud storage for efficiency.

- Version and Naming system
The version naming system will come up during data exporting loading into the environment.

- Data Reloading
As I will be using a publicly available dataset, only one copy is going to be downloaded and until update, it will not be reloaded.
However, if I am to use any API or Web Scrapping method - data will be updated weekly.

- Extraction procedures
The dataset is downloadable in CSV format from the UCI Machine Learning Repository.

5.3 Data Understanding

In order to start working with data, I need to be able to understand the data. Data is usually presented in various forms:
initially it is often presented in tabular form, but to get a better view on the data it is common to visualize data in graphs, such as a line graph, histogram or pie chart.

This part answers following questions:

5.3.1 Importing libraries

5.3.2 Importing data

Assigning and changing column names

Apparently there is 1 NaN row at the end that I have to get rid of.

5.3.3 Explaining column names

Click to expand the column names explanation. 1 Page_Popularity(likes) - Defines the popularity or support for the source of the document.

2 Page_Checkins - Describes how many individuals so far visited this place. This feature is only associated with the places eg:some institution, place, theater etc.

3 Page_Talking_About - Defines the daily interest of individuals towards source of the document/ Post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares, etc by visitors to the page.

4 Page_Category - Defines the category of the source of the document eg: place, institution, brand etc.

5 - 29 These features are aggregated by page, by calculating min, max, average, median and standard deviation of essential features.

30 CC1 - The total number of comments before selected base date/time.

31 CC2 - The number of comments in last 24 hours, relative to base date/time.

32 CC3- The number of comments in last 48 to last 24 hours relative to base date/time.

33 CC4 - The number of comments in the first 24 hours after the publication of post but before base date/time.

34 CC5- The difference between CC2 and CC3.

35 Base time - Decimal(0-71) Encoding, Selected time in order to simulate the scenario.

36 Post length - Character count in the post.

37 Target_Post_Share_Count - This features counts the no of shares of the post, that how many peoples had shared this post on to their timeline.

38 Post Promotion Status - To reach more people with posts in News Feed, individual promote their post and this features tells that whether the post is promoted(1) or not(0).

39 H_Local - Decimal(0-23) Encoding, This describes the H hrs, for which we have the target variable/ comments received.

40 - 46 Post published weekday - This represents the day(Sunday...Saturday) on which the post was published.

47 - 53 Weekdays feature - This represents the day(Sunday...Saturday) on selected base Date/Time.

54 Nr_Comments - The no of comments in next H hrs(H is given in Feature no 39).

General information about the database

The .info() function gives a fast and short overview on the dataframe.
There are 40949 data rows overall which gives high hopes for the model - a lot of data to train.
Dataset contains a lot of features, more or less needed in this project.
At first glance, I see no missing values which is a good indicator.

5.3.4 Computing Summary Statistics

Why? : Computing Summary Statistics is an essential step in the beginning of every data analysis and gives a lot of initial insights about data I am working with. Already at this stage I get to know about many computations (explained below) and can draw conclusions about the dataset.

Page_Popularity(likes) - total of 40949 values ranging from 36 to 486,972,297 with mean of 1,313,814, 25% of observations are below 36,734 likes.
Note that page like number may double in the dataset as it sometimes contain data from more than 1 post on this page.

Page_Checkins - total number of values is the same with min of o and max of 186,370 as this feature is applicable only to pages of places in witch users may check in.

Page_Talking_About - mean of 44,800 indicates that on average, people usually return to previously liked page and perhaps are interested in the content of it.

Page_Category - Categorical variable indicating the category of the page.

Post_Length - On average the posts' length equals to 163 characters with min of 0 which is a photo perhaps and max of 21480 which is a very long post.

Target_Post_Share_Count) - A lot of values for the target variable with mean od 117 which is relatively high, min of 0 and max of 144,860 which is enormous.
However, the 75% percentile is 61 which indicates that the numbers are rather close to the min than to the max.
What worries me is a high standard deviation which is over 8 times higher than the mean.

Nr_Comments - the mean nr of comments is rather low taking into account the enormous page like count and share count.

5.3.5 Visualizing Correlation on a heat map

Why? : Heat map is a convenient way to visualize the correlation of selected variables from dataset. The output is a clear overview shown by color density. This is a rather early stage but can already indicate the direction of model accuracy. I use it to check correlation between not only my target variable but also to research some other dependencies.

Graphics above gives me a lot of insights about correlation of features.

The correlation between the target variable and the likes feature equals to 0.33 which is a very weak positive correlation. It is caused by a high standard deviation
of both features and many big outliers.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data.
In the broadest sense correlation is any statistical association, though it actually refers to the degree to which a pair of variables are linearly related.

Standard deviation is a measure of the amount of variation or dispersion of a set of values.
A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set,
while a high standard deviation indicates that the values are spread out over a wider range.

Perhaps, after removing the outliers and normalizing the data the correlation will improve.
For now I will continue with EDA by plotting the correlation.

All the values are really big and to get any meaningfull insights I am going to have to normalize them.

The two most discussed scaling methods are Normalization and Standardization. Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

5.3.6 Standarization

Why? : "Data standardization is about making sure that data is internally consistent; that is, each data type has the same content and format. Standardized values are useful for tracking data that isn't easy to compare otherwise." I want to compare data that is is many different scales and would be useless to compare. That is why I standardize it.

After the standarization I want to check the correlation once again. First on the heat map and then on the regular scatterplot.
Note: This is just a try-out and I am almost sure this will not change anything on the heatmap.

As I thought, no change here.
In later phase I will remove the outliers and check the correlation once again.

I want to look at the heatmap one more time, maybe I find some better correlations.

There is a high correlation between 'Page_Popularity' and 'Page_Talking_About', I might want to consider and check it later. Lets look at the scatterplot.

Similar to the initial line. Normalizing does not change the distribution but scales the numbers. That is why the plots look identical just with different scales.

5.3.7 Examining page categories

Why? : My target variable is highly dependent on mother page and to fully understand pages I want to dive into its categories to get some more insights about domain and data.

The histogram above illustrates the distribution of page categories. From the dataset description I loaded the category names into environment.
I can see that Product/Service pages are the most present in the data with over 7494 pages. Then public figures and retail very close and still on podium with 4511 and 4301 registered pages.

5.3.8 Examining post data

The bar chart above aims at visualizing the tendency of posts that get re-shared by users. From the top, I can retrieve that Thursdays and Sundays are weekdays with the highest rate of sharing by users. To confirm that I compare this graph with the total amount of posts shared by pages as the high mean might be caused by a high number of 'initial' shares. However, the bottom plot proves me opposite. Apparently Thursdays and especially Sundays are days with the smallest number of new posts.

Perhaps, on those days with relatively low amount of new information users get a chance to focus their attention on what is available and are more likely to pass it forward.

Similarly to the previous visualization I want to discover the amount of comments that appear under a post, in this case in 24 hours after publishing. Mean values are nearing this time with slight advantage of Wednesday and Sunday - again. I double check the result by plotting the total amount of comments grouped by weekday and make sure my conclusion is valid.

What is interesting, Sunday has both the highest mean of comment written and the lowest number of them which indicates that the mean is not caused by a huge amount of observations.

Linking both observations from previous and this visualization I can certainly see a connection with high re-share number and big amount of comments on Sunday. Perhaps a post with a lot of comments is promoted by Facebook algorithms and reaches more users which accelerates this process. However, it could work the other way around when a lot of people re-shares the post and that is the reason for outstanding number of comments. Nevertheless, both features influence or corelate with each other.

Page_Talking_About defines the daily interest of individuals towards source of the document/ Post. People who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares, etc by visitors to the page.
The bar chart above indicates that Wednesday has the most engagement through the week, similarly to the number of comments and new posts that are being shared.

However, this time it is most likely caused by high number of total engagement recoded it my data. For this reason I might want to check the median value.

And again, SUNDAY! The median (10938) is almost 4 times smaller than the mean (47140) which tells me that data is poorly distributed and skewed.

The median represents the 50th percentile of a dataset. That is, exactly half of the values in the dataset are larger than the median and half of the values are lower.
Also, it is an important metric to calculate because it gives us an idea of where the “center” of a dataset is located. It also gives us an idea of the “typical” value in a given dataset.

5.4 Data Preparation

Why? : I have already made some adjustments to the initial dataset to even start the project and I think the data is pretty clean. However, before moving to the modelling phase I need to make sure some essential conditions are in place. Data preparation ensures analysts trust, understand, and ask better questions of their data, making their analyses more accurate and meaningful. From more meaningful data analysis comes better insights and, of course, better outcomes.

Check for duplicated values

There are 8 duplicated values in the data overall which is a small number. Its influence to the whole dataset is rather none.
However, I will get rid of duplicates anyway.

Check for missing values

In order to avoid bias I need to fill all null values from the database. I start with calculation null values for each column.

Click to read more about bias in Machine Learning. "Bias is a phenomenon that skews the result of an algorithm in favor or against an idea.
Bias is considered a systematic error that occurs in the machine learning model itself due to incorrect assumptions in the ML process.
Technically, we can define bias as the error between average model prediction and the ground truth. Moreover, it describes how well the model matches the training data set:
A model with a higher bias would not match the data set closely.
A low bias model will closely match the training data set."

Fortunately, the dataset is complete and does not require any data to be filled up.

Aggregate weekday of publishing into one column

Now, the weekday of publishing is split into 7 columns with 0 and 1 indicators. To use this feature as a class for classification I need to merge it into one column.

Check data types

The datatype for interesting columns are correct. Interesting columns are the ones needed for modeling and continuing EDA:
To save space in the document I only print chosen features types.

That is where my initial EDA ends, I will be returning to and improving it as project moves forward.


Go back to Table of contents.



6. Predictions (Phase 3)

The 3rd phase is model training and evaluating.
In the 1 st iteration I will be using two algorithms to check which one solves my problem better.
kNN - which is both used for classification and regression, and linear regression. Perhaps, in next iterations I am also going to use different algorithms but for now I want to focus on those.

K-Nearest Neighbors

Click to read more about this algorithm. "In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression: In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically."

6.1 Preprocessing

6.1.1 Data Standardization

I start with standardizing the values as it is necessary for the algorithms. This process allows me to compare scores between different types of variables.

6.1.2 Selecting features

In this step I choose the features I will need for modelling. Those features were already chosen at the beginning of this project and are as follows:

X - Popularity of mother page and week day of publishing
Y - Target variable, a value the model will predict, Number of post share count

Those features were chosen during the domain understanding as according to my research are dependent on each other.

6.1.3 Dividing data into a training and test set

I am using the train_test_split() function to divide the dataset into training and testing pieces. I choose to use test size of 0.2 as the correlation is pretty low
and I want to sacrifice more data for training.

6.1.4 Removing outliers

I remove the outliers to stay with pure data with no exceptions from the majority.

From feedback after iteration 0 I got a hint to try modeling without removing outliers and that is why I will be computing everything with data with and without outliers.

The script above prints the max, mean, min and std of variables, than removes the outliers and prints the summary one more time to compare the output.

6.1.5 Selecting features with no outliers

The plott looks much better now. Values are scaled and with no outliers.

6.1.6 Dividing data into train and test set with no outliers

To easily toggle between features I simply create a "_o" suffix to the variable names.

6.2 Modelling

6.2.1 Linear Regression

I start modeling with linear regression. To have a clear comparison I will create 2 models and print both results.

Both models are similar both in slope and intercept.

6.2.2 k-Nearest Neighbors

For the kNN I will be using KNeighborsRegressor() which is a regression equivalent of kNeighborsClassifier.
To find the best K value which results in lowest RMSE value I will check all options in range 1-20 both for 'initial' features as well as for 'no outliers' features.

Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data.
I believe the best K value is somewhere around 10. This time the no outlier data results in significantly better RMSE value. I think it will be clearly visible on the plot.

Both plots illustrates the same thing - influence of K value in change of RMSE. While the upper plot has no clear patter the bottom one is easily readable. The RMSE decreases while K increases with local minimum around 7 - 10. In order to check the best K value I can use GridSearchCV feature from sklearn.

The code above takes some time to run as it has to iterate over all values I specify. The more values the bigger time. Finally I can create the model with already known best K value

While creating the model I specify the KNeighborsClassifier() as 10.

Here are some things to keep in mind:
As we decrease the value of K to 1, our predictions become less stable.
Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging,
and thus, more likely to make more accurate predictions (up to a certain point).
Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker."

6.3 Evaluation

6.3.1 Linear Regression

MSE - Mean Squared Error is a most used and very simple metric with a little bit of change in mean absolute error. Mean squared error states that finding the squared difference between actual and predicted value. The lower it gest - the model is better. 1 - 0 for no outliers.

MAE - Mean Absolute Error is a very simple metric which calculates the absolute difference between actual and predicted values. Again the lower the better. 2 - 0 for no outliers.

RMSE - Root Mean Squared Error is clear by the name itself, that it is a simple square root of mean squared error. The output value is in the same unit as the required output variable which makes interpretation of loss easy. 3 - 0 for no outliers.

R^2 - R Squared is a metric that tells the performance of model, not the loss in an absolute sense that how many wells did model perform. With help of R squared I have a baseline model to compare a model which none of the other metrics provides. The same as in classification problems which are call a threshold which is fixed at 0.5. So basically R2 squared calculates how must regression line is better than a mean line. This time a draw, 4 - 1 still for no outliers.


Overall the model created from features with no outliers performs much better.

6.3.2 k-Nearest Neighbors

The score of 33% is significantly better that the one from Iteration 0. I did spend much more time on preprocessing and modelling which results you see above.

Below I will comment on my previous evaluation adding points for this (improved) one.

The accuracy score of 0.19% is actually disappointing. In the next iteration I will investigate its cause and apply solutions.
I did not expected a 100% accuracy however, with this correlation I did expect somewhat higher and a model with this accuracy is rather useless.
It might be caused by my mistake somewhere in the process or by choosing not the best algorithm. Anyway, I will improve this project in further iterations and hopefully so will the accuracy score.
Possible reasons for low accuracy I see so far:

The second iteration brings a positive outcome to this project with the accuracy increased to 33% which indicates an improvement to the model. I have confirmed my initial thoughts about the goal of this project not being about the accuracy but about the whole process of walking through the methodology.
The algorithm I originally used for Iteration 0 was indeed suitable for the problem, however in the wrong 'settings' of classification instead of regression resulting in a complete opposite output. Lessons learned, algorithm fixed.
Iteration 1 also brings a new discovery that data with outliers removed suits the model better and results in higher accuracy. This had to be checked and that is why Phases 2 & 3 are divided into 'Outliers' and 'No Outliers' sections. The mistake in feature selection, which was localized by my teacher, is now repaired and causes no more errors in chunks below.
Experiments were also part of this submission. With MinMaxValues method from Sklearn package I tried to normalize values for the model. Unfortunately, I got a bit lost in the documentation, did not want to prolong the submission date and gave up on this idea.
In order to chose the best hyperparameters for algorithm I used a script that does it for me and assures the best results.
That being summarized, I consider this iteration as a major improvement to the initial project and am looking forward to the next ones.

Go back to Table of contents.



7. Conclusion

Iteration 0

Click to see my conclusion. Iteration 0 is a good start and try-out of this project. I have already dived into the topic and presented initial findings in this document.

Starting with domain understanding I began to research the topic and discovered some insights which were unknown to me earlier. It was a good opportunity to make sure that this topic suites me and aligns with my assumptions. I came up with research questions and some additional research points which I will extend in future iterations.

Afterwards, I moved into the data part of this project, finding a suitable database took some time and resulted in a small list with interesting also for EDA datasets. Collection of data requirements remains unfinished and will be also updated as the project moves forward.

The time has come to load the data and start experimenting. Loading and tidying took some time due to high number of observations and features but the good thing is that it only needs to be done once. Initial EDA went smooth with no big surprises and I had an opportunity to explore the dataset.

Next, I started ML actions with preprocessing and removing outliers to prepare my data for modelling.
Model creation and evaluation went smooth, at leas the very process. Unfortunately, the accuracy score is for now really low which upsets and makes me curious. However, this project is not about getting the 100% accuracy but about the whole process of AI project and the steps I take to achieve any.

Summarizing, what have I learned or practiced:
- A lot of HTML tags and styles :)
- Research methods
- Stating meaningful questions
- Referencing sources
- Data loading
- Data cleaning
- Data plotting
- Generally a lot of Python syntax
- kNN ML algorithm
- Describing summary statistics
and many more.

Planning for next iteration:
- Extend domain understanding
- Plan interview with expert
- Continue EDA
- Use additional database
- Examine low accuracy score
- Try out different algorithm

Iteration 1

Click to see my conclusion. Iteration 1 is a decent improvement for this project. I have already gathered feedback after i0 and took a chance to enhance my work. I am definitely satisfied with what I have improved.

Starting with domain understanding I continued to research the topic and discovered more interesting facts supported with data. The layout of domain understanding changed a bit as I dived into the topic more and wanted to structure my findings. I also came up with the planning for an expert interview which will conclude my domain research.

Afterwards, I moved into the data part of this project and updated some parts. I already chosen an additional dataset on which I want to perform additional EDA in the next phase.

The second big update was in the EDA part. I created many new plots aiming at one goal - understand the data and information better. I played around with various formats and colors which improve the visual aspect of EDA which is also very important in my opinion.

Next, I returned to ML part for which I was most curious about.
I decided to create 2 paths for my data to experiment with various scenarios. I created a set with no and with outliers to check on which I get a better results. In the modelling part I also chosen to try out different algorithms. kNN which I repaired as I wrongly used it for iteration 0 and linear regression a new one. For both of them I tried 2 versions of sets to see which one outputs better results and experiment.

Summarizing, what have I learned or practiced:
- (Again) A lot of HTML tags and styles :)
- (Again) Research methods
- (Again) Data plotting
- (Again) Generally a lot of Python syntax
- (Again) kNN ML algorithm for regression + linear regression

Initial planning for next iteration:
- Extend domain understanding with Interview
- Continue EDA on different dataset
- Try out different algorithm

Go back to Table of contents.



8. Appendix

Go back to Table of contents.



9. Feedback

This section includes all feedback I received on this project. The idea is to make it transparent and easily accessible.

Feedback Iteration 0 + Addressing it in Iteration 1

Click to see feedback.
- Machine Learning

Hi Andrzej, first off I believe there is a lot of good stuff in this document.
I also believe the document is highly verbose,
which results in it being very hard for me to read it and try to focus on the good stuff
.
There is no need to explain the theory in this document, you may assume that the reader is well versed in AI and understands the jargon.
I do like that you explain what steps you take and why, and that you try different things.
The fact that you rumble the data a little to see if you can influence the correlation heatmap gives me the impression that you know
what the heatmap means and what you are looking for.

However, at some point you removed the Target Variable from the heatmap and went into comparing features, but I do not understand why you did that. Reasons why your accuracy may be super low:
Maybe this is a regression problem and you used a classification algorithm?
Maybe you removed too many outliers? If you flatten the variance too much the algorithm has no way to distinguish one class from another.
Maybe you selected the wrong features. I believe you selected ['Page_Popularity', Wday_Nr'] are you sure that as the best combination?
For me this challenge is a GO, you can address my feedback in the next iterations.
- Michielsen, Bas B.S.H.T. , 15 Mar at 16:53

- No need to explain theory - Indeed, I removed most of theory explanation to save space in the doc.
- Document is highly verbose - I tried to limit some parts by introducing toggle buttons and made moving through doc easier by adding links to sections.
- You removed the Target Variable - Wanted to experiment but didn't mention in, now its fixed.
- Maybe this is a regression problem - Haha Indeed, I fixed kNN to 'regression' settings.
- Maybe you removed too many outliers - I did check it and no outliers set gives better score.
- Maybe you selected the wrong features - Perhaps, I stick to my initial plan, however I want to talk it over with you and see my options.
- address my feedback - I sure did :).

- Data analytics & Investigative analysis

Hi Andrzej
It is really nice that you made such an extensive data analysis. Nevertheless, I would prefer to see the results that are actually carrying valuable information regarding your project.
I really appreciate the amount of effort that you have put but maybe next time try to be more focused towards one goal.
- Pencheva, Sabina S. , 30 Mar at 22:10

- really nice that you made such an extensive data analysis - I try.
- results that are actually carrying valuable information - Indeed, I believe you will like my new plots which actually do so.
- try to be more focused towards one goal - I think I understand and tried to keep my EDA on point.
- References, from the meeting - fixed now.


- Societal Impact

Hey Andrzej, I just took a look at your first version of Iteration Zero.
Good work so far. Some remarks:
Who is your (fictional) client? (it is best to know for whom you are building and writing this,
especially for how you visualize and report on his in the Delivery Phase -> good for learning outcome "Targeted interaction").
What would be the added value for your prediction for this (fictional) client? Why do this?
Is there a way to interview an expert on this subject (Nick Welman will give a session surrounding this this afternoon for SI).
Maybe take a short time to consider the additional impact your project could have on people or society (could be positive or negative impact).
But overall I think you are already doing a good job, so keep that up!
If you would like to speak with me online still (maybe now, or maybe after doing something with the feedback) we keep schedule something on Wednesdays, Thursdays or Fridays. Take care!
- [16/03/2022 11:57] Bloks,Danny D. Hey Andrzej, As mentioned this is a fitting project, and the most important SI feedback relates to who you (fictional) client is and
why your project could be beneficial for them (besides the smaller feedback I gave).
One thing I forgot to mention is that there are quite a few typos, so maybe give a document a "readover" before uploading it.
Good luck!
- Bloks, Danny D. , 16 Mar at 12:00

- Who is your (fictional) client? - I added this section.
- What would be the added value - Also added in this Iteration.
- Way to interview an expert on this subject - It is, I am scheduling it rn and already created a plan(in this doc)
- consider the additional impact your project could have on people or society - Added in this iteration.
- there are quite a few typos - Yeah, speed writing in markdown... I equipped myself with a spell check software - any blame on it from now on :)

Feedback Iteration 1

Click to see feedback.
- Machine Learning

Hi Andrzej I am generally positive about the work that you demonstrate here. However, try not to 'overdo' it. I can see that you know what you are doing, the structure of this notebook is excellent and therefore I can see that I am looking at a piece of quality work, but the flow of the document is chaotic and spikey. If reading this document could be compared to a ride in a theme park, you have built a rollercoaster, whereas I would have liked those little boats that go into different lands and gently show me nice things. It seems like you want to do TOO MUCH in one go. Just pick one thing, do that, and if you do not like the result, make a new chapter phase2/3 and try another thing. Here there is linear regression with outliers and without outliers at the same time, and then comes kNN also with outlier and without and then in evaluation we go back to linear regression and after again kNN, up down, left right, up down weeeeeee. Generally it is so that if you have a defensible strategy for removing outliers, you do not need to run ML any more on the dataset with the outliers because without the outliers almost always performs better. In 6.1.2 you say which features you selected, but not WHY. Given that feature selection is defining for the result, I would have liked to read your reasoning. In 6.3.2 you seem to come back to a previous conclusion and have new additions to it this time. I do not disagree with the idea of referring back to a previous find, but the way you set up this chapter results in clunky reading. If I read this section out loud, I fee like a robot. 😁 Perhaps you could write it as a sort of 'discussion of findings'? All in all I like what you are doing, see if you could make smaller iterations with clearer defined steps that are not multiple steps intertwined, and perhaps give yourself the time to reason on the results in a discussion like humans would do (taken the domain understanding into consideration).
- Michielsen, Bas B.S.H.T. , 11 Apr at 16:19

- Data analytics & Investigative analysis

Hi Andrzej, I really appreciate the effort that you put in the document, good job! What I would encourage you to do is to try to focus on the why part, for example, when you talk about normalising you literally explained what is the concept behind it but you didn't say why are you actually using it.
- Pencheva, Sabina S. , 11 Apr at 11:34


- Societal Impact

Hey Andrzej, Well written Proposal in this notebook! You tick all the boxes to proof you can apply the needed learning outcome in a Proposal. Did you find an expert to interview already? Go on to the other phases! Good luck!
- Bloks, Danny D. , 5 Apr at 14:40

Go back to Table of contents.



10. References

Shanika Wickramasinghe(2021)Bias & Variance in Machine Learning: Concepts & Tutorials. Retrieved 03:15, March 6, 2022, from
https://www.bmc.com/blogs/bias-variance-machine-learning/#:~:text=What%20is%20bias%20in%20machine,assumptions%20in%20the%20ML%20process

Wikipedia contributors. (2022, February 24). K-nearest neighbors algorithm. In Wikipedia, The Free Encyclopedia. Retrieved 09:06, March 14, 2022, from
https://en.wikipedia.org/w/index.php?title=K-nearest_neighbors_algorithm&oldid=1073696056

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.

Moro, S., et al., Predicting social media performance metrics and evaluation of the impact on brand building:
A data mining approach, Journal of Business Research (2016), http://dx.doi.org/10.1016/j.jbusres.2016.02.010

Y. Zhao, Y. Zhang, Comparison of decision tree methods for finding active objects, Advances in Space Research 41 (12) (2008) 1955–1959.

Kamaljot Singh. & Ranjeet Kaur.(2015).Comment Volume Prediction using Neural Networks and Decision Trees.
Retrieved 11:32, March 3, 2022, from
https://www.researchgate.net/profile/Kamaljot-Singh-2/publication/301284745_Comment_Volume_Prediction_using_Neural_Networks_and_Decision_Trees/links/570f3ce808aecd31ec9a95bf/Comment-Volume-Prediction-using-Neural-Networks-and-Decision-Trees.pdf

Onel Harrison.(Sep 10, 2018).Machine Learning Basics with the K-Nearest Neighbors Algorithm.
Retrieved March 14, 2022, from
https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

Aishwarya Singh.A Practical Introduction to K-Nearest Neighbors Algorithm for Regression.
Retrieved March 30, 2022, from
https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/

Raghav Agrawal.(May 19, 2021).Know The Best Evaluation Metrics for Your Regression Model ! Retrieved March 30, 2022, from
https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-regression-model/

Bonestroo, W.J., Meesters, M., Niels, R., Schagen, J.D., Henneke, L., Turnhout, K. van (2018): ICT Research Methods. HBO-i, Amsterdam. ISBN/EAN: 9990002067426.
Available from: http://www.ictresearchmethods.nl/

Number of monthly active Facebook users worldwide as of 4th quarter 2021. Statista.com
Retrieved March 31, 2022, from
https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/

Visobe, & Mohamud. (2019, July 12). Why it's important to standardize your data - atlan: Humans of data. Atlan.
Retrieved April 12, 2022, from https://humansofdata.atlan.com/2018/12/data-standardization/

Go back to Table of contents.


Iteration 2

In this Iteration I want to focus on extending Phase 3 with introducing a new algorithm and trying out different features. Additionally, I will create a first version of delivery Phase 4, which will be later improved in final Iteration.



6. Predictions (Phase 3)

A new algorithm I want to try out, is Support Vector Machine. It's original purpose was for classification problems, however there is also a regression option.
This time I am going to dive into the documentation much more than I did for the First Iteration kNN which resulted in a classification model predicting continuos variable.

Looking at the heat map from previous iterations, I realized a relatively high (for this data) correlation between 'Page_Popularity' and 'Page_Talking_About'.
My target variable stays the same for the whole project - 'Target_Post_Share_Count' and I want to switch 'Wday_Nr' to 'Page_Talking_About'.

Additionally, SVMs despite being a complicated algorithms, give a challenging visualization opportunities which I plan to take up.

This Iteration is intended to have a different structure from previous one (no more rollercoaster) and will be basing on only one set of features with no outliers as that kind of set is already proven to have better model performance.


As I will be using some new libraries, I load them here (Iteration 2) to have an easy access and prevent scrolling through whole document.

6.1 Preprocessing

I start with creating a fresh copy of original dataset with selected features.

6.1.1 Removing outliers

In the next step I remove the outliers to stay with pure data with no exceptions from the majority. I do it manually as I already created a working script for it. However, I am aware of availability of ready functions that could do it easier.Perhaps, I will try to use them in the future.

6.1.2 Scaling features

In the previous iteration I used Standardization to scale the features. Now I want to try the most used technique in ML industry -> Min-Max Normalization

There are two main reasons that support the need for scaling:

6.1.3 Selecting features

In this step I select the features for my model. They are already chosen and explained in previous points. Additionally, I plot two features to visualize their correlation.

It looks like a lot of values are concentrated in 0-0.2 area and slowly extend their proximity as x and y increases.
However, the scale is tricky and should be more a squared shape.

6.1.4 Dividing data into a training and test set

I am using the train_test_split() function to divide the dataset into training and testing pieces. I choose to use test size of 0.3 as the correlation is still pretty low
and I want to sacrifice more data for training.

6.2 Modelling

In this part I want to create two models in parallel a Linear regression and SVM. I explain all hyperparameter below model creation.

The above chunk output took over 6 hour to compute and resulted with C:200 and epsilon:5

kernel='rbf'
kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’
RBF kernels are the most generalized form of kernelization and is one of the most widely used kernels due to its similarity to the Gaussian distribution. The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they are to each other.
"RBF Kernel is popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during training and not the entire dataset."

C=200
The C parameter tells the SVM optimization how much I want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, I should get misclassified examples, often even if my training data is linearly separable.

epsilon=5
From documentation: "Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value."
To be honest, this hyperparameter is a bit vague for me. However I know one thing for sure: The larger ϵ is, the larger errors I admit in my solution, and for now it is enough.

6.2.1 Visualization - Linear Regression Prediction Surface

I take up the challenge of 3d visualization, while there should be no problem with linear regression, the SVM will be much more advanced.

After a decent amount of time I manage to find tutorials on how to create this 'layer' of model which is apparently called a plane. As occurs, creating a 3d model is super easy. Creating a plane is super complex.

6.2.2 Visualization - SVR Prediction Surface

WOW, that looks sophisticated and professional.

I think those two plots ideally visualize why a SVM algorithm is better.
Multiple linear regression is a flat plane that tries to fit all data points with a single cut. SVR is more flexible, soft and can bend and fold in whatever way to fit the data better. This enables me to get a more accurate model.

6.3 Evaluation

Speaking of accuracy, I will now evaluate the performance of SVM and see if it brings any improvement to my project.

6.3.1 Support Vector Machine

Well, the RMSE is higher than I expected and the score is lower than I expected.
For now, this is the worst performing model in this project. The kNN did a much better job. After playing with hyperparameters I noticed that they highly influence those numbers (FYI: I knew it way before obviously) and changing them improves the score. There must be some convenient way of choosing the best ones like I did for kNN in previous iteration. Will have to research it.

At this moment I have no other clue how to improve this model other than the hyperparameter tuning. I am still a SVM geek and this was the first time I used this algorithm. I will seek improvements after receiving feedback from a ML expert.

Go back to Table of contents.



7. Delivery (Phase 4)

Delivery is the last phase of every AI project and focuses on deployment of ist solution and reporting. The key element is to put my model to the test by demonstrating it to my stakeholder. The delivery phase is completed when the feedback from my stakeholders is incorporated in a final submission.

I will start with model selection, choosing 1 out of 3 I have created during this project. Basing on evaluation of each one, the best algorithm will be chosen for deployment.
Next step is to create a fully working AI prototype witch takes a user input and outputs the predicted value.
Once the application is created, it will be sent for field testing to the project stakeholder who will provide me with feedback and ideas for improvements.
After incorporating those into the prototype, I can move to the next step - Collecting and Documenting in which I will gather the most important information about the application.
The last milestone would be to present the final product again to the stakeholder and provide him with a project report.

Further, the project is marked as complete.

7.1 Model selection

During this project, I have produced 3 different Machine Learning models aiming at predicting the Facebooks' post share volume based on selected features. Experimenting with different algorithms gives me an opportunity to chose the best performing one and implement it into a prototype.

List of available models:

As all models are created using a regression algorithms and predict a continuos variable, model selection based on accuracy score (percentage) is impossible as this applies only for classification problems. That is why, the selection will be performed on the following scores:

Every model was trained and tested on the same dataset and with the same train/test split division. Additionally, a grid search was performed for every model assuring the most optimal hyperparameter values influencing its performance.


The best performing model with both the lowest RMSE and the highest Score, outperforming other models in numbers, is:

k-Nearest Neighbors (Regressor)

which is going to be deployed and put into a field testing.

7.2 Model deployment

In this part I create a a fully working AI prototype in a form of a web application. The product has the following requirements:

You can try out this prototype here !

7.3 Application field testing

7.4 Collecting & Documenting

7.5 Presentation & Reporting

Go back to Table of contents.



8. Conclusion

Iteration 2

Click to see my conclusion.

Go back to Table of contents.



9. Feedback

This section includes all feedback I received on this project. The idea is to make it transparent and easily accessible.

Feedback Iteration 2 + Addressing it in Iteration 3 (Placeholder)

Click to see feedback.
- Machine Learning


- Data analytics & Investigative analysis



- Societal Impact


Go back to Table of contents.



10. References

Need of feature scaling in machine learning. (n.d.).
Retrieved April 13, 2022, from https://www.enjoyalgorithms.com/blog/need-of-feature-scaling-in-machine-learning

Dobilas, S. (2022, February 12). Support vector regression (SVR) - one of the most flexible yet robust prediction algorithms. Medium.
Retrieved April 13, 2022, from https://towardsdatascience.com/support-vector-regression-svr-one-of-the-most-flexible-yet-robust-prediction-algorithms-4d25fbdaca60

Sreenivasa, S. (2020, October 12). Radial basis function (RBF) kernel: The go-to kernel. Medium.
Retrieved April 13, 2022, from https://towardsdatascience.com/radial-basis-function-rbf-kernel-the-go-to-kernel-acf0d22c798a

Built-in. Built-in continuous color scales with Python. (n.d.).
Retrieved April 13, 2022, from https://plotly.com/python/builtin-colorscales/

Go back to Table of contents.


Created by Andrzej Krasnodebski, 2022. Fontys University of Applied Sciences.